Search CORE

33 research outputs found

VFHQ: A High-Quality Dataset and Benchmark for Video Face Super-Resolution

Author: Dong Chao
Shan Ying
Wang Liangbin Xie. Xintao
Zhang Honglun
Publication venue
Publication date: 06/05/2022
Field of study

Most of the existing video face super-resolution (VFSR) methods are trained and evaluated on VoxCeleb1, which is designed specifically for speaker identification and the frames in this dataset are of low quality. As a consequence, the VFSR models trained on this dataset can not output visual-pleasing results. In this paper, we develop an automatic and scalable pipeline to collect a high-quality video face dataset (VFHQ), which contains over

16,000

high-fidelity clips of diverse interview scenarios. To verify the necessity of VFHQ, we further conduct experiments and demonstrate that VFSR models trained on our VFHQ dataset can generate results with sharper edges and finer textures than those trained on VoxCeleb1. In addition, we show that the temporal information plays a pivotal role in eliminating video consistency issues as well as further improving visual performance. Based on VFHQ, by analyzing the benchmarking study of several state-of-the-art algorithms under bicubic and blind settings. See our project page: https://liangbinxie.github.io/projects/vfhqComment: Project webpage available at https://liangbinxie.github.io/projects/vfh

arXiv.org e-Print Archive

Review of 2D Animation Restoration in Visual Domain Based on Deep Learning

Author: LI Yuhang XIE Liangbin, DONG Chao
Publication venue: Journal of Computer Engineering and Applications Beijing Co., Ltd., Science Press
Publication date: 01/12/2023
Field of study

Traditional 2D animation is a distinct visual style with a production process and image characteristics that differ significantly from real-life scenes. It usually requires drawing pictures frame by frame and saving them as bitmaps. During the storage, transmission, and playback process, 2D animation may encounter problems such as picture quality degradation, insufficient resolution, and discontinuous timing. With the development of deep learning technology, it has been widely used in the field of animation restoration. This paper provides a comprehensive summary of 2D animation restoration based on deep learning. Firstly, exploring existing animation datasets can help identify the available data support and the bottleneck in establishing animation datasets. Secondly, investigating and testing deep learning-based algorithms for animation image quality restoration and animation interpolation can help identify key points and challenges in animation restoration. Additionally, introducing methods designed to ensure consistency between animation frames can provide insights for future animation video restoration. Analyzing the effectiveness of existing image quality assessment (IQA) methods for animation images can help identify practical IQA methods to guide restoration results. Finally, based on the above analysis, this paper clarifies the challenges in animation restoration tasks and presents future development directions of deep learning in animation restoration field

Directory of Open Access Journals

Rethinking Alignment in Video Super-Resolution Transformers

Author: Dong Chao
Gu Jinjin
Shi Shuwei
Wang Xintao
Xie Liangbin
Yang Yujiu
Publication venue
Publication date: 09/10/2022
Field of study

The alignment of adjacent frames is considered an essential operation in video super-resolution (VSR). Advanced VSR models, including the latest VSR Transformers, are generally equipped with well-designed alignment modules. However, the progress of the self-attention mechanism may violate this common sense. In this paper, we rethink the role of alignment in VSR Transformers and make several counter-intuitive observations. Our experiments show that: (i) VSR Transformers can directly utilize multi-frame information from unaligned videos, and (ii) existing alignment methods are sometimes harmful to VSR Transformers. These observations indicate that we can further improve the performance of VSR Transformers simply by removing the alignment module and adopting a larger attention window. Nevertheless, such designs will dramatically increase the computational burden, and cannot deal with large motions. Therefore, we propose a new and efficient alignment method called patch alignment, which aligns image patches instead of pixels. VSR Transformers equipped with patch alignment could demonstrate state-of-the-art performance on multiple benchmarks. Our work provides valuable insights on how multi-frame information is used in VSR and how to select alignment methods for different networks/datasets. Codes and models will be released at https://github.com/XPixelGroup/RethinkVSRAlignment.Comment: This paper has been accepted for NeurIPS 202

arXiv.org e-Print Archive

Enhanced Quadratic Video Interpolation

Author: Dong Chao
Liu Yihao
Qiao Yu
Siyao Li
Sun Wenxiu
Xie Liangbin
Publication venue
Publication date: 09/09/2020
Field of study

With the prosperity of digital video industry, video frame interpolation has arisen continuous attention in computer vision community and become a new upsurge in industry. Many learning-based methods have been proposed and achieved progressive results. Among them, a recent algorithm named quadratic video interpolation (QVI) achieves appealing performance. It exploits higher-order motion information (e.g. acceleration) and successfully models the estimation of interpolated flow. However, its produced intermediate frames still contain some unsatisfactory ghosting, artifacts and inaccurate motion, especially when large and complex motion occurs. In this work, we further improve the performance of QVI from three facets and propose an enhanced quadratic video interpolation (EQVI) model. In particular, we adopt a rectified quadratic flow prediction (RQFP) formulation with least squares method to estimate the motion more accurately. Complementary with image pixel-level blending, we introduce a residual contextual synthesis network (RCSN) to employ contextual information in high-dimensional feature space, which could help the model handle more complicated scenes and motion patterns. Moreover, to further boost the performance, we devise a novel multi-scale fusion network (MS-Fusion) which can be regarded as a learnable augmentation process. The proposed EQVI model won the first place in the AIM2020 Video Temporal Super-Resolution Challenge.Comment: Winning solution of AIM2020 VTSR Challenge (in conjunction with ECCV 2020

arXiv.org e-Print Archive

Crossref

T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models

Author: Mou Chong
Qi Zhongang
Qie Xiaohu
Shan Ying
Wang Xintao
Xie Liangbin
Zhang Jian
Publication venue
Publication date: 16/02/2023
Field of study

The incredible generative ability of large-scale text-to-image (T2I) models has demonstrated strong power of learning complex structures and meaningful semantics. However, relying solely on text prompts cannot fully take advantage of the knowledge learned by the model, especially when flexible and accurate structure control is needed. In this paper, we aim to ``dig out" the capabilities that T2I models have implicitly learned, and then explicitly use them to control the generation more granularly. Specifically, we propose to learn simple and small T2I-Adapters to align internal knowledge in T2I models with external control signals, while freezing the original large T2I models. In this way, we can train various adapters according to different conditions, and achieve rich control and editing effects. Further, the proposed T2I-Adapters have attractive properties of practical value, such as composability and generalization ability. Extensive experiments demonstrate that our T2I-Adapter has promising generation quality and a wide range of applications.Comment: Tech Report. GitHub: https://github.com/TencentARC/T2I-Adapte

arXiv.org e-Print Archive

StyleAdapter: A Single-Pass LoRA-Free Model for Stylized Image Generation

Author: Luo Ping
Qi Zhongang
Shan Ying
Wang Wenping
Wang Xintao
Wang Zhouxia
Xie Liangbin
Publication venue
Publication date: 04/09/2023
Field of study

This paper presents a LoRA-free method for stylized image generation that takes a text prompt and style reference images as inputs and produces an output image in a single pass. Unlike existing methods that rely on training a separate LoRA for each style, our method can adapt to various styles with a unified model. However, this poses two challenges: 1) the prompt loses controllability over the generated content, and 2) the output image inherits both the semantic and style features of the style reference image, compromising its content fidelity. To address these challenges, we introduce StyleAdapter, a model that comprises two components: a two-path cross-attention module (TPCA) and three decoupling strategies. These components enable our model to process the prompt and style reference features separately and reduce the strong coupling between the semantic and style information in the style references. StyleAdapter can generate high-quality images that match the content of the prompts and adopt the style of the references (even for unseen styles) in a single pass, which is more flexible and efficient than previous methods. Experiments have been conducted to demonstrate the superiority of our method over previous works.Comment: AIG

arXiv.org e-Print Archive

Hardhat

Author: Xie Liangbin
Publication venue: Harvard Dataverse
Publication date
Field of study

Hardhat </ul

Harvard Dataverse Network

Mitigating Artifacts in Real-World Video Super-resolution Models

Author: Dong Chao
Gu Jinjin
Shan Ying
Shi Shuwei
Wang Xintao
Xie Liangbin
Publication venue: Association for the Advancement of Artificial Intelligence
Publication date: 26/06/2023
Field of study

The recurrent structure is a prevalent framework for the task of video super-resolution, which models the temporal dependency between frames via hidden states. When applied to real-world scenarios with unknown and complex degradations, hidden states tend to contain unpleasant artifacts and propagate them to restored frames. In this circumstance, our analyses show that such artifacts can be largely alleviated when the hidden state is replaced with a cleaner counterpart. Based on the observations, we propose a Hidden State Attention (HSA) module to mitigate artifacts in real-world video super-resolution. Specifically, we first adopt various cheap filters to produce a hidden state pool. For example, Gaussian blur filters are for smoothing artifacts while sharpening filters are for enhancing details. To aggregate a new hidden state that contains fewer artifacts from the hidden state pool, we devise a Selective Cross Attention (SCA) module, in which the attention between input features and each hidden state is calculated. Equipped with HSA, our proposed method, namely FastRealVSR, is able to achieve 2x speedup while obtaining better performance than Real-BasicVSR. Codes will be available at https://github.com/TencentARC/FastRealVSR

Association for the Advancement of Artificial Intelligence: AAAI Publications